Skip to content

[8.7.0] Add Bazel support for --rewind_lost_inputs #28971

Open
fmeum wants to merge 3 commits intobazelbuild:release-8.7.0from
fmeum:cherry-pick-rewind-lost-inputs-8.7.0
Open

[8.7.0] Add Bazel support for --rewind_lost_inputs #28971
fmeum wants to merge 3 commits intobazelbuild:release-8.7.0from
fmeum:cherry-pick-rewind-lost-inputs-8.7.0

Conversation

@fmeum
Copy link
Collaborator

@fmeum fmeum commented Mar 12, 2026

Background

As of #25396, action rewinding (controlled by --rewind_lost_inputs) and build rewinding (controlled by --experimental_remote_cache_eviction_retries) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if --jobs=1, as discovered in #25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:

  • When a lost input is detected, the progress of actions running concurrently isn't lost.
  • Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
  • Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
  • Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

Changes

This PR adds Bazel support for --rewind_lost_inputs with arbitrary --jobs values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new RewoundActionSynchronizer interface to SkyframeActionExecutor that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (--remote_cache_async).

The synchronization scheme relies on a single ReadWriteLock that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in RemoteRewoundActionSynchronizer for details as well as a proof that this scheme is free of deadlocks.


Subsumes the previously reviewed #25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10):

bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled

Fixes #26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Includes the following cherry-picked changes:

  • 2be693e
  • 464eacb
  • a small fix in RewindingTestHelpers that normalizes line endings to \n before asserting on content, which is necessary for tests to pass on Windows

@fmeum fmeum force-pushed the cherry-pick-rewind-lost-inputs-8.7.0 branch 5 times, most recently from f2ff9ba to f3f28f2 Compare March 13, 2026 14:13
fmeum added 2 commits March 13, 2026 15:20
As of bazelbuild#25396, action rewinding (controlled by `--rewind_lost_inputs`) and build rewinding (controlled by `--experimental_remote_cache_eviction_retries`) are equally effective at recovering lost inputs.
However, action rewinding in Bazel is prone to races, which renders it unusable in practice - in fact, there are races even if `--jobs=1`, as discovered in bazelbuild#25412. It does have a number of benefits compared to build rewinding, which makes it worth fixing these issues:
* When a lost input is detected, the progress of actions running concurrently isn't lost.
* Build rewinding can start a large number of invocations with their own build lifecycle, which greatly complicates build observability.
* Finding a good value for the allowed number of build retries is difficult since a single input may be lost multiple times and rewinding can discover additional lost inputs, but the at the same time builds that ultimately fail shouldn't be retried indefinitely.
* Build rewinding drops all action cache entries that mention remote files when it encounters a lost input, which can compound remote cache issues.

This PR adds Bazel support for `--rewind_lost_inputs` with arbitrary `--jobs` values by synchronizing action preparation, execution and post-processing in the presence of rewound actions. This is necessary with Bazel's remote filesystem since it is backed by the local filesystem and needs to support local execution of actions, whereas Blaze uses a content-addressed filesystem that can be updated atomically.

Synchronization is achieved by adding try-with-resources scopes backed by a new `RewoundActionSynchronizer` interface to `SkyframeActionExecutor` that wrap action preparation (which primarily deletes action outputs) and action execution, thus preventing a rewound action from deleting its outputs while downstream actions read them concurrently. Additional synchronization is required to handle async remote cache uploads (`--remote_cache_async`).

The synchronization scheme relies on a single `ReadWriteLock` that is only ever locked for reading until the first time an action is rewound, which ensures that performance doesn't regress for the common case of builds without lost inputs. Upon the first time an action is rewound, the single lock is inflated to a concurrent map of locks that permits concurrency between actions as long as dependency relations between rewound and non-rewound actions are honored (i.e., an action consuming a non-lost input of a rewound action can't execute concurrently with that action's preparation and execution). See the comment in `RemoteRewoundActionSynchronizer` for details as well as a proof that this scheme is free of deadlocks.
________

Subsumes the previously reviewed bazelbuild#25412, which couldn't be merged due to the lack of synchronization.

Tested for races manually by running the following command (also with `ActionRewindStrategy.MAX_ACTION_REWIND_EVENTS = 10`):
```
bazel test //src/test/java/com/google/devtools/build/lib/skyframe/rewinding:RewindingTest --test_filter=com.google.devtools.build.lib.skyframe.rewinding.RewindingTest#multipleLostInputsForRewindPlan --runs_per_test=1000 --runs_per_test_detects_flakes --test_sharding_strategy=disabled
```

Fixes bazelbuild#26657

RELNOTES: Bazel now has experimental support for --rewind_lost_inputs, which can rerun actions within a single build to recover from (remote or disk) cache evictions.

Closes bazelbuild#25477.

PiperOrigin-RevId: 882050264
Change-Id: I79b7d22bdb83224088a34be62c492a966e9be132
(cherry picked from commit 464eacb)
This ensures that the journal file is not kept open after a build, which has been observed to cause `RewindingTest` to fail on Windows, which disallows deleting open files (in this case during test case cleanup).

Since the journal is written out at most every 3 seconds for all current usages of the `PersistentMap` class, the overhead of the additional `open` is negligible.

Also include small tweaks to `RewindingTest` so that it can be enabled on Windows. In particular, the order of `SymlinkAction` and `SourceManifestAction` being rewound doesn't seem to be fixed and can differ on Windows.

Closes bazelbuild#28108.

PiperOrigin-RevId: 855242645
Change-Id: Ic7434139b290c6b0e2061977f747e890c3c5ece6
(cherry picked from commit 2be693e)
@fmeum fmeum force-pushed the cherry-pick-rewind-lost-inputs-8.7.0 branch from f3f28f2 to 0c16de2 Compare March 13, 2026 14:20
@fmeum fmeum marked this pull request as ready for review March 15, 2026 11:03
@fmeum fmeum requested a review from a team as a code owner March 15, 2026 11:03
@fmeum fmeum requested a review from coeuvre March 15, 2026 11:03
@github-actions github-actions bot added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Mar 15, 2026
@iancha1992 iancha1992 enabled auto-merge March 16, 2026 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting-review PR is awaiting review from an assigned reviewer team-Remote-Exec Issues and PRs for the Execution (Remote) team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant